Westlock County
LegalWebAgent: Empowering Access to Justice via LLM-Based Web Agents
Tan, Jinzhe, Benyekhlef, Karim
Access to justice remains a global challenge, with many citizens still finding it difficult to seek help from the justice system when facing legal issues. Although the internet provides abundant legal information and services, navigating complex websites, understanding legal terminology, and filling out procedural forms continue to pose barriers to accessing justice. This paper introduces the LegalWebAgent framework that employs a web agent powered by multimodal large language models to bridge the gap in access to justice for ordinary citizens. The framework combines the natural language understanding capabilities of large language models with multimodal perception, enabling a complete process from user query to concrete action. It operates in three stages: the Ask Module understands user needs through natural language processing; the Browse Module autonomously navigates webpages, interacts with page elements (including forms and calendars), and extracts information from HTML structures and webpage screenshots; the Act Module synthesizes information for users or performs direct actions like form completion and schedule booking. To evaluate its effectiveness, we designed a benchmark test covering 15 real-world tasks, simulating typical legal service processes relevant to Québec civil law users, from problem identification to procedural operations. Evaluation results show LegalWebAgent achieved a peak success rate of 86.7%, with an average of 84.4% across all tested models, demonstrating high autonomy in complex real-world scenarios.
- North America > Canada > Quebec > Montreal (0.05)
- North America > Canada > Alberta > Census Division No. 13 > Westlock County (0.04)
- North America > Canada > Alberta > Census Division No. 11 > Sturgeon County (0.04)
The Effect of Enforcing Fairness on Reshaping Explanations in Machine Learning Models
Anderson, Joshua Wolff, Visweswaran, Shyam
Trustworthy machine learning in healthcare requires strong predictive performance, fairness, and explanations. While it is known that improving fairness can affect predictive performance, little is known about how fairness improvements influence explainability, an essential ingredient for clinical trust. Clinicians may hesitate to rely on a model whose explanations shift after fairness constraints are applied. In this study, we examine how enhancing fairness through bias mitigation techniques reshapes Shapley-based feature rankings. We quantify changes in feature importance rankings after applying fairness constraints across three datasets: pediatric urinary tract infection risk, direct anticoagulant bleeding risk, and recidivism risk. We also evaluate multiple model classes on the stability of Shapley-based rankings. We find that increasing model fairness across racial subgroups can significantly alter feature importance rankings, sometimes in different ways across groups. These results highlight the need to jointly consider accuracy, fairness, and explainability in model assessment rather than in isolation.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- Oceania > Guam (0.04)
- North America > United States > Alaska (0.04)
- (2 more...)
- Research Report > Experimental Study (0.69)
- Research Report > New Finding (0.68)
RAG System for Supporting Japanese Litigation Procedures: Faithful Response Generation Complying with Legal Norms
Ishihara, Yuya, Keyaki, Atsushi, Yamada, Hiroaki, Ohara, Ryutaro, Sumida, Mihoko
This study discusses the essential components that a Retrieval-Augmented Generation (RAG)-based LLM system should possess in order to support Japanese medical litigation procedures complying with legal norms. In litigation, expert commissioners, such as physicians, architects, accountants, and engineers, provide specialized knowledge to help judges clarify points of dispute. When considering the substitution of these expert roles with a RAG-based LLM system, the constraint of strict adherence to legal norms is imposed. Specifically, three requirements arise: (1) the retrieval module must retrieve appropriate external knowledge relevant to the disputed issues in accordance with the principle prohibiting the use of private knowledge, (2) the responses generated must originate from the context provided by the RAG and remain faithful to that context, and (3) the retrieval module must reference external knowledge with appropriate timestamps corresponding to the issues at hand. This paper discusses the design of a RAG-based LLM system that satisfies these requirements.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.15)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- North America > United States (0.04)
- (4 more...)
Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data
Khatri, Mann, Yusuf, Mirza, Shah, Rajiv Ratn, Kumaraguru, Ponnurangam
Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.
- North America > Canada > Alberta > Census Division No. 13 > Westlock County (0.24)
- North America > Canada > Alberta > Census Division No. 11 > Sturgeon County (0.24)
- Europe > United Kingdom (0.14)
- (8 more...)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.71)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- North America > United States > California (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > California (0.04)
- North America > Canada > Alberta > Census Division No. 13 > Westlock County (0.04)
- (4 more...)
- Research Report > Experimental Study (1.00)
- Research Report > Strength High (0.94)
- Government (1.00)
- Health & Medicine (0.93)
- Law (0.68)
- North America > Canada > Alberta > Census Division No. 13 > Westlock County (0.14)
- Asia > Myanmar > Tanintharyi Region > Dawei (0.04)
- North America > United States > Pennsylvania (0.04)
- (4 more...)
- Health & Medicine > Therapeutic Area > Neurology (0.54)
- Health & Medicine > Therapeutic Area > Ophthalmology/Optometry (0.41)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.05)
- North America > United States > California (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)
Thinker: Training LLMs in Hierarchical Thinking for Deep Search via Multi-Turn Interaction
Xu, Jun, Du, Xinkai, Ao, Yu, Zhao, Peilong, Li, Yang, Zhong, Ling, Yuan, Lin, Bo, Zhongpu, Wang, Xiaorui, Sun, Mengshu, Gui, Zhengke, Zhang, Dalong, Wang, Zhaoyang, Wang, Qiwei, Hou, Yangyang, Yin, Zhiying, Wang, Haofen, Chen, Huajun, Liang, Lei, Zhou, Jun
Efficient retrieval of external knowledge bases and web pages is crucial for enhancing the reasoning abilities of LLMs. Previous works on training LLMs to leverage external retrievers for solving complex problems have predominantly employed end-to-end reinforcement learning. However, these approaches neglect supervision over the reasoning process, making it difficult to guarantee logical coherence and rigor. To address these limitations, we propose Thinker, a hierarchical thinking model for deep search through multi-turn interaction, making the reasoning process supervisable and verifiable. It decomposes complex problems into independently solvable sub-problems, each dually represented in both natural language and an equivalent logical function to support knowledge base and web searches. Concurrently, dependencies between sub-problems are passed as parameters via these logical functions, enhancing the logical coherence of the problem-solving process. To avoid unnecessary external searches, we perform knowledge boundary determination to check if a sub-problem is within the LLM's intrinsic knowledge, allowing it to answer directly. Experimental results indicate that with as few as several hundred training samples, the performance of Thinker is competitive with established baselines. Furthermore, when scaled to the full training set, Thinker significantly outperforms these methods across various datasets and model sizes. The source code is available at https://github.com/OpenSPG/KAG-Thinker.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- Europe > Austria > Vienna (0.14)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- (15 more...)
- Research Report (0.81)
- Workflow (0.68)
- Leisure & Entertainment (1.00)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine (1.00)
- Media > Film (0.68)
PRBench: Large-Scale Expert Rubrics for Evaluating High-Stakes Professional Reasoning
Akyürek, Afra Feyza, Gosai, Advait, Zhang, Chen Bo Calvin, Gupta, Vipul, Jeong, Jaehwan, Gunjal, Anisha, Rabbani, Tahseen, Mazzone, Maria, Randolph, David, Meymand, Mohammad Mahmoudi, Chattha, Gurshaan, Rodriguez, Paula, Mares, Diego, Singh, Pavit, Liu, Michael, Chawla, Subodh, Cline, Pete, Ogaz, Lucy, Hernandez, Ernesto, Wang, Zihao, Bhatter, Pavi, Ayestaran, Marcos, Liu, Bing, He, Yunzhong
Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. Existing evaluations often fail to assess open-ended, economically consequential tasks in high-stakes domains like Legal and Finance, where practical returns are paramount. To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark of real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it, to our knowledge, the largest public, rubric-based benchmark for both legal and finance domains. We recruit 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks inspired by their actual workflows. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions. Our expert-curated rubrics are validated through a rigorous quality pipeline, including independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further catalog associated economic impacts of the prompts and analyze performance using human-annotated rubric categories. Our analysis shows that models with similar overall scores can diverge significantly on specific capabilities. Common failure modes include inaccurate judgments, a lack of process transparency and incomplete reasoning, highlighting critical gaps in their reliability for professional adoption.
- North America > United States > California (0.04)
- North America > Canada > Alberta > Census Division No. 13 > Westlock County (0.04)
- North America > Canada > Alberta > Census Division No. 11 > Sturgeon County (0.04)
- (2 more...)
- Banking & Finance (1.00)
- Health & Medicine > Government Relations & Public Policy (0.67)
- Law > Litigation (0.46)
- Government > Regional Government > North America Government > United States Government (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)